[SC 12254] Add new deepeval tests in lib #426

AnilSorathiya · 2025-09-15T12:43:44Z

Pull Request Description

What and why?

Integrate Deepeval scorers into ValidMind as first-class scorers under a dedicated deepeval namespace, enabling evaluation of LLM outputs with standardized metrics.
Add Deepeval-based LLM scorers (e.g., Hallucination, Contextual Precision/Recall, Summarization, Task Completion) and supporting demo notebook for end-to-end usage.
**Maintenance: update .gitignore for *.deepeval artifacts; remove deprecated/duplicate tests (e.g., Geval); improve plots (e.g., boxplot) and examples.

How to test

Notebook validation:
- Run notebooks/code_sharing/deepeval_integration_demo.ipynb end-to-end; verify Deepeval scorers run, log results, and produce expected figures/tables.
- Run notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb with validmind/datasets/llm/agent_dataset.py to exercise Task Completion and related scorers.
Scorer/runtime tests:
- Run pytest for scorer interfaces and decorator behavior:
  - pytest -q tests/test_scorer_decorator.py
  - pytest -q tests/test_unit_tests.py and any added tests tagged for Deepeval/scorers.

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

…gentic-model-in-vm-library

…-in-the-init-model-to

…nd-statistical-tests

…val-dataset-llmtestcase

cachafla · 2025-10-13T05:40:17Z

Some feedback for notebooks/code_samples/agents/langgraph_agent_simple_banking_demo.ipynb:

Instead of We'll use our comprehensive banking test dataset to evaluate our agent's performance across different banking scenarios. I'd suggest:

We'll use a sample test dataset to evaluate our agent's performance across different banking scenarios.

For validmind.scorer.llm.deepeval.TaskCompletion, how does a user control the default verbosity of DeepEval tests? They print a lot of things.

For Let's add box plot for task completion score. there should an explanation that the previous test has added a new column TaskCompletion_score as part of assign_scores and that's what we're going to use for the box plot. We should also explain that these columns are added because of how our scorer return values are processed.

cachafla · 2025-10-13T07:36:53Z

For notebooks/code_sharing/deepeval_integration_demo.ipynb:

Does the %pip install -q validmind require [all] like the other notebook?

For Compute metrics using ValidMind scorer interface:

There should be a short explanation clarifying how the test knows where the input and output columns are declared. That way the end user will know how the input dataset is being used.

Alternatively we can also pass input_column and actual_output_column explicitly so we self document how the scorers work, even though we match the default argument values.

I tend to think that this could apply for all uses of scorers in demo notebooks actually 🤔.

Towards the end of the notebook, on the section Integrate with ValidMind:

This code is not needed because vm is already initialized at the beginning:

    # Initialize ValidMind
    vm.init()
    print("ValidMind initialized")

At the end of the notebook there's a cell with this text: FIXED VERSION. What is that?

The last cell runs each of the custom_metrics with:

result = metric.measure(test_case)

How does that integrate with VM? Via scorers or tests? As a user I wouldn't know how to bring those results from GEval tests to a ValidMind document.

cachafla

Looking good 🙌 apologies for the delay reviewing this.

My suggestion would be put the golden datasets + GEval in separate notebooks. This notebook has everything we need but it can feel heavy but also trying to do multiple things at the same time.

Specifically, I'd recommend:

Leaving this notebook as a demonstration of integration with DeepEval LLMTestCase and SummarizationMetric
Another notebook that demonstrates how to use LLMAgentDataset with the Golden dataset from DeepEval
Another notebook that demonstrates how to use GEval with VM scorers and/or VM tests

Specifically for Golden I feel like we should define what is the actual use case we want to demonstrate here. Geval has a more clear objective here but the Golden examples with the mock LLM usage feel a bit out of place.

To expedite merging ths PR we can probably update the notebook to not have the golden datasets + GEval and come back to that on a follow up PR.

Thoughts?

notebooks/code_sharing/deepeval_integration_demo.ipynb

validmind/scorer/llm/deepeval/ContextualPrecision.py

juanmleng · 2025-10-13T10:06:21Z

Looking good 🙌 apologies for the delay reviewing this.

My suggestion would be put the golden datasets + GEval in separate notebooks. This notebook has everything we need but it can feel heavy but also trying to do multiple things at the same time.

Specifically, I'd recommend:

Leaving this notebook as a demonstration of integration with DeepEval LLMTestCase and SummarizationMetric

Another notebook that demonstrates how to use LLMAgentDataset with the Golden dataset from DeepEval

Another notebook that demonstrates how to use GEval with VM scorers and/or VM tests

Specifically for Golden I feel like we should define what is the actual use case we want to demonstrate here. Geval has a more clear objective here but the Golden examples with the mock LLM usage feel a bit out of place.

To expedite merging ths PR we can probably update the notebook to not have the golden datasets + GEval and come back to that on a follow up PR.

Thoughts?

These comments look very sensible to me. Totally agree!

…s-in-lib

AnilSorathiya · 2025-10-13T18:44:25Z

For validmind.scorer.llm.deepeval.TaskCompletion, how does a user control the default verbosity of DeepEval tests? They print a lot of things.

This is known issue in Deeleval.
Generally you can pass verbose_mode = False but quite a few tests ignore it. I am passing

    metric = TaskCompletionMetric(
        threshold=threshold,
        model=model,
        include_reason=True,
        strict_mode=strict_mode,
        verbose_mode=False,
    )

These tests ignore the verbose_mode:
Answer Relevancy
Faithfulness
Contextual Precision
Contextual Recall
Contextual Relevancy
...

AnilSorathiya · 2025-10-14T14:16:53Z

For Let's add box plot for task completion score. there should an explanation that the previous test has added a new column TaskCompletion_score as part of assign_scores and that's what we're going to use for the box plot. We should also explain that these columns are added because of how our scorer return values are processed.

Added separate section in the notebook:

## Scorers in ValidMind

Scorers are evaluation metrics that analyze model outputs and store their results in the dataset. When using `assign_scores()`:

- Each scorer adds a new column to the dataset with format: {scorer_name}_{metric_name}
- The column contains the numeric score (typically 0-1) for each example
- Multiple scorers can be run on the same dataset, each adding their own column
- Scores are persisted in the dataset for later analysis and visualization
- Common scorer patterns include:
  - Model performance metrics (accuracy, F1, etc)
  - Output quality metrics (relevance, faithfulness)
  - Task-specific metrics (completion, correctness)

AnilSorathiya · 2025-10-15T11:37:14Z

There should be a short explanation clarifying how the test knows where the input and output columns are declared. That way the end user will know how the input dataset is being used.

Alternatively we can also pass input_column and actual_output_column explicitly so we self document how the scorers work, even though we match the default argument values.

I tend to think that this could apply for all uses of scorers in demo notebooks actually 🤔.
Make sense. thanks.

AnilSorathiya · 2025-10-15T11:56:44Z

The last cell runs each of the custom_metrics with:
result = metric.measure(test_case)
How does that integrate with VM? Via scorers or tests? As a user I wouldn't know how to bring those results from GEval tests to a ValidMind document.

This section has been removed from the notebook. I will create a separate notebook in this PR #434

AnilSorathiya · 2025-10-15T12:00:08Z

Looking good 🙌 apologies for the delay reviewing this.

My suggestion would be put the golden datasets + GEval in separate notebooks. This notebook has everything we need but it can feel heavy but also trying to do multiple things at the same time.

Specifically, I'd recommend:

Leaving this notebook as a demonstration of integration with DeepEval LLMTestCase and SummarizationMetric

Another notebook that demonstrates how to use LLMAgentDataset with the Golden dataset from DeepEval

Another notebook that demonstrates how to use GEval with VM scorers and/or VM tests

Specifically for Golden I feel like we should define what is the actual use case we want to demonstrate here. Geval has a more clear objective here but the Golden examples with the mock LLM usage feel a bit out of place.

To expedite merging ths PR we can probably update the notebook to not have the golden datasets + GEval and come back to that on a follow up PR.

Thoughts?

yes, Agree.

Working on G-eval in the separate PR where I will create a separate notebook [SC 12707] Add G-eval test in lib #434
for Golden dataset we will have a separate notebook as well when we have clear usecase in mind.

notebooks/code_sharing/deepeval_integration_demo.ipynb

juanmleng

LGTM! Great notebook! Just left a small comment.

…s-in-lib

github-actions · 2025-10-17T11:08:42Z

PR Summary

This PR introduces significant enhancements and bugfixes to improve the integration between ValidMind and DeepEval. Key functional changes include:

Modifications to the .gitignore to support additional file formats such as *.qmd and files related to DeepEval.
Updates to test datasets and notebooks: Several notebooks have been updated to demonstrate new use cases covering various LLM test scenarios. This includes sample test cases for banking risk evaluation, retrieval-augmented generation (RAG) systems, and agent evaluations. Test cases now leverage DeepEval metrics such as TaskCompletion, Faithfulness, Summarization, Bias, Contextual Relevancy, Contextual Precision, and Contextual Recall.
Enhanced handling of tool calls in agent evaluations: The logic to extract tool calls, responses, and to integrate these into the TaskCompletion metric was refactored. This ensures that both dictionary and object formats are supported and that the tool response data is correctly processed.
Updates to dataset conversion: In the LLMAgentDataset conversion functions, the serialization of tool call fields has been simplified by removing redundant serialization steps.
Numerous improvements to code comments and documentation within the notebooks and scorer modules, making the evaluation flows clearer for users.
A minor version bump (2.10.1) in configuration files, which is not mentioned in the summary of functional changes.

Overall, these changes aim to streamline the testing infrastructure for LLMs, improve metric evaluations, and provide detailed insights into agent behavior through advanced scoring metrics from DeepEval.

Test Suggestions

Run the updated notebooks to ensure that all new DeepEval-based test cases compute scores correctly for various metrics (e.g., TaskCompletion, Summarization).
Validate the extraction logic for tool calls by providing both dictionary and object formatted messages and confirming correct ToolCall instantiation.
Execute unit tests for each new scorer module (Bias, ContextualPrecision, ContextualRecall, ContextualRelevancy, Faithfulness, Hallucination, and TaskCompletion) to verify they handle input datasets as expected.
Manually inspect the output of the modified LLMAgentDataset to verify that the 'tools_called' field is correctly populated without unnecessary serialization steps.

AnilSorathiya added 30 commits June 24, 2025 11:18

support agent use case

1b3f67a

wrapper function for agent

723fcab

ragas metrics

28d9fbb

update ragas metrics

ecf8e09

fix lint error

53e8879

create helper functions

1662368

Merge branch 'main' into anilsorathiya/sc-10863/add-support-for-llm-a…

cc84cbc

…gentic-model-in-vm-library

delete old notebook

6f09780

update description for each section

0bb731e

simplify agent

e758979

simple demo notebook using langchain agent

7c35cfe

Update description of the simplified langgraph agent demo notebook

9bb70e9

add brief description to tests

894d52a

add brief description to tests

d86a9af

Allow dict return type predict_fn

884000f

update notebook and refactor utils

fbd5aa9

lint fix

daceabf

Merge branch 'main' into anilsorathiya/sc-11324/extend-the-predict-fn…

5f8823a

…-in-the-init-model-to

fix the test failure

70a5636

new unit tests for multiple columns return in assign_predictions

33b06fb

update notebooks to return multiple values in predict_fn

8e12bd2

general plotting and stats tests

e38929d

clear output

e900a65

Merge branch 'main' into anilsorathiya/sc-11380/add-generlize-plots-a…

a08e881

…nd-statistical-tests

remove duplicate tests

16f4700

update notebook

bb9f9af

Integration between deepeval and validmind

5078a7a

Merge branch 'main' into anilsorathiya/sc-11452/support-for-the-deepe…

2eb6abb

…val-dataset-llmtestcase

add MetricValues class for metric return type

ad0b719

Return MetricValues in the unit tests

94ca006

AnilSorathiya added 3 commits October 7, 2025 13:18

gitignore *.deepeval

db63fe4

update boxplot

8b43a77

update deepeval integration notebook

d6c22df

AnilSorathiya added the enhancement New feature or request label Oct 7, 2025

AnilSorathiya marked this pull request as ready for review October 7, 2025 16:04

AnilSorathiya requested review from cachafla, johnwalz97 and juanmleng October 7, 2025 16:05

cachafla reviewed Oct 13, 2025

View reviewed changes

juanmleng reviewed Oct 13, 2025

View reviewed changes

notebooks/code_sharing/deepeval_integration_demo.ipynb Show resolved Hide resolved

juanmleng reviewed Oct 13, 2025

View reviewed changes

validmind/scorer/llm/deepeval/ContextualPrecision.py Show resolved Hide resolved

Merge branch 'main' into anilsorathiya/sc-12254/add-new-deepeval-test…

1b7cc74

…s-in-lib

remove all tag from validmind lib installation

e3755bf

update notebooks

a4d7de8

juanmleng reviewed Oct 16, 2025

View reviewed changes

notebooks/code_sharing/deepeval_integration_demo.ipynb Show resolved Hide resolved

juanmleng approved these changes Oct 16, 2025

View reviewed changes

AnilSorathiya added 3 commits October 17, 2025 12:05

update notebook

9437e51

Merge branch 'main' into anilsorathiya/sc-12254/add-new-deepeval-test…

142950f

…s-in-lib

2.10.1

38e331d

AnilSorathiya merged commit 34ba898 into main Oct 17, 2025
17 checks passed

AnilSorathiya deleted the anilsorathiya/sc-12254/add-new-deepeval-tests-in-lib branch October 17, 2025 11:22

[SC 12254] Add new deepeval tests in lib #426

[SC 12254] Add new deepeval tests in lib #426

Uh oh!

Conversation

AnilSorathiya commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

What and why?

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

Uh oh!

cachafla commented Oct 13, 2025

Uh oh!

cachafla commented Oct 13, 2025

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

juanmleng commented Oct 13, 2025

Uh oh!

AnilSorathiya commented Oct 13, 2025

Uh oh!

AnilSorathiya commented Oct 14, 2025

Uh oh!

AnilSorathiya commented Oct 15, 2025

Uh oh!

AnilSorathiya commented Oct 15, 2025

Uh oh!

AnilSorathiya commented Oct 15, 2025

Uh oh!

Uh oh!

juanmleng left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 17, 2025

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AnilSorathiya commented Sep 15, 2025 •

edited

Loading